Beyond Word N-Grams
نویسندگان
چکیده
We describe, analyze, and experimentally evaluate a new probabilistic model for wordsequence prediction in natural languages, based on prediction suffi~v trees (PSTs). By using efficient data structures, we extend the notion of PST to unbounded vocabularies. We also show how to use a Bayesian approach based on recursive priors over all possible PSTs to efficiently maintain tree mixtures. These mixtures have provably and practically better performance than almost any single model. Finally, we evaluate the model on several corpora. The low perplexity achieved by relatively small PST mixture models suggests that they may be an advantageous alternative, both theoretically and practically, to the widely used n-gram models. 1 I n t r o d u c t i o n Finite-state methods for the statistical prediction of word sequences in natural language have had an important role in language processing research since Markov's and Shannon's pioneering investigations (C.E. Shannon, 1951). While it has always been clear that natural texts are not Markov processes of any finite order (Good, 1969), because of very long range correlations between words in a text such as those arising from subject matter , low-order alphabetic n-gram models have been used very effectively for such tasks as statistical language identification and spelling correction, and low-order word n-gram models have been the tool of choice for language modeling in speech recognition. However, low-order n-gram models fail to capture even relatively local dependencies that exceed model order, for instance those created by long but frequent compound names or technical terms. Unfortunately, extending model order to accommodate those longer dependencies is not practical, since the size of n-gram models is in principle exponential on the order of the model. Recently, several methods have been proposed (Ron et al., 1994; Willems et al., 1994) that are able to model longer-range regularities over small alphabets while avoiding the size explosion caused by model order. In those models, the length of contexts used to predict particular symbols is adaptively extended as long as the extension improves prediction above a given threshold. The key ingredient of the model construction is the prediction suffix tree (PST), whose nodes represent suffixes of past input and specify a predictive distribution over possible successors of the suffix. It was shown in (Ron et al., 1994) that under realistic conditions a PST is equivalent to a Markov process of variable order and can be represented efficiently by a probabilistic finite-state automaton. For the purposes of this paper, however, we will use PSTs as our start ing point. The problem of sequence prediction appears more difficult when the sequence elements are words rather than characters from a small fixed alphabet. The set of words is in principle unbounded, since
منابع مشابه
Lexical-semantic resources: yet powerful resources for automatic personality classification
In this paper, we aim to reveal the impact of lexical-semantic resources, used in particular for word sense disambiguation and sense-level semantic categorization, on automatic personality classification task. While stylistic features (e.g., part-of-speech counts) have been shown their power in this task, the impact of semantics beyond targeted word lists is relatively unexplored. We propose an...
متن کاملPhonetic speaker recognition using maximum-likelihood binary-decision tree models
Recent work in phonetic speaker recognition has shown that modeling phone sequences using n-grams is a viable and effective approach to speaker recognition, primarily aiming at capturing speaker-dependent pronunciation and also word usage. This paper describes a method involving binary-tree-structured statistical models for extending the phonetic context beyond that of standard n-grams (particu...
متن کامل60 70 16 v 1 1 3 Ju l 1 99 6 Beyond Word N - Grams
We describe, analyze, and evaluate experimentally a new probabilistic model for word-sequence prediction in natural language based on prediction suffix trees (PSTs). By using efficient data structures, we extend the notion of PST to unbounded vocabularies. We also show how to use a Bayesian approach based on recursive priors over all possible PSTs to efficiently maintain tree mixtures. These mi...
متن کاملMulti-class composite n-gram language model using multiple word clusters and word successions
In this paper, a new language model, the Multi-Class Composite N-gram, is proposed to avoid a data sparseness problem in small amount of training data. The Multi-Class Composite Ngram maintains an accurate word prediction capability and reliability for sparse data with a compact model size based on multiple word clusters, so-called Multi-Classes. In the Multi-Class, the statistical connectivity...
متن کاملMulti-Class Composite N-gram Language Model for Spoken Language Processing Using Multiple Word Clusters
In this paper, a new language model, the Multi-Class Composite N-gram, is proposed to avoid a data sparseness problem for spoken language in that it is difficult to collect training data. The Multi-Class Composite N-gram maintains an accurate word prediction capability and reliability for sparse data with a compact model size based on multiple word clusters, called MultiClasses. In the Multi-Cl...
متن کاملComparing word, character, and phoneme n-grams for subjective utterance recognition
In this paper, we compare the performance of classifiers trained using word n-grams, character n-grams, and phoneme n-grams for recognizing subjective utterances in multiparty conversation. We show that there is value in using very shallow linguistic representations, such as character n-grams, for recognizing subjective utterances, in particular, gains in the recall of subjective utterances.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره cmp-lg/9607016 شماره
صفحات -
تاریخ انتشار 1995